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ABSTRACT 

With growing data vohimes from synoptic surveys, astronomers necessarily 
must become more abstracted from the discovery and introspection processes. 
Given the scarcity of follow-up resources, there is a particularly sharp onus on 
the frameworks that replace these human roles to provide accurate and well- 
calibrated probabilistic classification catalogs. Such catalogs inform the subse- 
quent follow-up, allowing consumers to optimize the selection of specific sources 
for further study and permitting rigorous treatment of purities and efficiencies 
for population studies. Here, we describe a process to produce a probabilistic 
classification catalog of variability with machine learning from a multi-epoch pho- 
tometric survey. In addition to producing accurate classifications, we show how 
to estimate calibrated class probabilities, and motivate the importance of prob- 
ability calibration. We also introduce a methodology for feature-based anomaly 
detection, which allows discovery of objects in the survey that do not fit within 
the predefined class taxonomy. Finally, we apply these methods to sources ob- 
served by the All Sky Automated Survey (ASAS), and unveil the Machine-learned 
ASAS Classification Catalog (MACC), which is a 28-class probabilistic classifi- 
cation catalog of 50,124 ASAS sources. We estimate that MACC achieves a 
sub-20% classification error rate, and demonstrate that the class posterior prob- 
abilities are reasonably calibrated. MACC classifications compare favorably to 
the classifications of several previous domain-specific ASAS papers and to the 
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ASAS Catalog of Variable Stars, which had classified only 24% of those sources 
into one of 12 science classes. The MACC is publicly available on-line. 

Subject headings: methods: data analysis - methods: statistical - stars: vari- 
ables: general - techniques: photometric - catalogs 

1. Introduction 

Synoptic imaging surveys have begun to routinely collect dozens to thousands of epochs 
of photometric data over wide swaths of the sky. The manifest destiny for optical time- 
domain studies is the Large Synoptic Survey Telescope (LSST; Tyson 2002), which will 
collect time histories for O{10^) stars and explosive transients. With data collected for so 
many sources, no longer is it possible for experts to manually scrutinize significant subsets 
of the data. In this era of wide-field time-domain surveys, accurate multi-class source cat- 
alogs, which are created automatically by machine-learning (ML) algorithms, are required 
to maximize the scientific output from these projects (Eyer et al. 2008; Borne et al. 2009). 
Furthermore, these catalogs must be probabilistic in nature, with well-calibrated posterior 
class probabilities. This enables each scientist to use a personalized threshold for selecting 
objects for follow-up, where science class probabilities fit naturally within a framework for 
optimizing the allocation of limited resources, and to select objects for population studies, 
where a rigorous treatment requires detailed understanding of the purities and efficiencies of 
the sample. 

Creating probabilistic multi-class catalogs for large-scale time-domain photometric sur- 
veys is a difficult task. First and foremost, a set of salient class-predictive features^ needs 
to be estimated for each source. From unevenly sampled light curves which contain sea- 
sonal gaps, varying levels of noise, and occasional spurious flux measurements, estimating 
the periods and amplitudes of oscillation for each source is not trivial. Furthermore, devising 
light-curve features that can separate speciflc sub-classes of sources requires deep domain 
knowledge. Next, classiflcation models must be constructed to map the light-curve feature 
vector for each source to a set of posterior class probabilities. These classifiers need to be 
able to automatically learn multiple class boundaries in high-dimensional feature spaces from 
a set of training data with known classes and, for each source, return a calibrated posterior 
class probability for each science class. This endeavor is complicated by the fact that the set 



^We define a feature to be deterministic real-numbered or categorical metric based on the time-series 
input or spatial location of the source. See §3. 
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of training data is typically not representative of the objects in the survey, which can cause 
large sample-selection biases (see Richards et al. 2012) in the posterior class probability es- 
timates. Additionally, the sources observed by the survey are not guaranteed to fit neatly 
into any of the pre-defined classes, necessitating anomaly detection to identify which sources 
are likely to belong to an undefined science class. 

Several aspects of the catalog effort have required focused research attention. In Richards 
et al. (2011), we introduced an end-to-end framework for machine-learned classification of 
variable stars, with advancements in periodic and non-periodic light-curve feature estimation 
as well as probabilistic, non-parametric classification methodology. In terms of classification 
error rate, our methods showed significant improvement over the previous state-of-the-art 
(Debosscher et al. 2007) on a well-studied data set. Indeed, other groups have also converged 
onto a similar set of tools as the best current light-curve classification methodology for vari- 
able stars (e.g., Dubath et al. 2011). In Richards et al. (2012), we introduced a methodology 
to overcome the debilitating effects of non-representative training sets on variable star classi- 
fication, and in Long et al. (2012) we devised methods to appropriately use light curve data 
from older surveys to classify periodic variable stars in new surveys. With these advances, 
the accuracy of variable star classification is improving demonstrably, with cross-validated 
error rates approaching 15-20% on multi-class problems with different data sets (Dubath 
et al. 2011; Richards ct al. 2011). 

In this paper, we build on these recent advancements in the photometric classification 
of variability by focusing on the problem of how to properly construct a variable star clas- 
sification catalog from a photometric survey. Accurate classification of each source in the 
survey remains the primary concern of this endeavor. However, there are several other issues 
that arise when generating classification catalogs for use in astrophysical studies. First, a 
classification catalog requires good calibration for the posterior class probability estimates, 
P( class I survey data). Good calibration means that of all the objects for which we estimate 
a posterior class probability, p, of belonging to a certain science class, p proportion of them 
truly belong to that class. In this paper, we describe a method for calibrating classifier 
probabilities and outline how such information can and should be used when employing such 
a probabilistic classification catalog for downstream astrophysical inference. 

Second, when constructing a classification catalog for a large number of objects, anoma- 
lies will certainly be present. When building a supervised classification model, these anoma- 
lies are typically not accounted for, resulting in a classification schema which attempts to 
cram each object into a predefined classification taxonomy. In this paper, we describe the 
use of a semi-supervised anomaly detection routine which allows us to leverage our classifier 
to determine which sources do not resemble any of the training data and likely belong to 
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a variability class not populated by the training set. We determine, for each source in the 
catalog, a real-valued measure of the degree of deviance of that source from the training 
data. 

Third, in a photometric survey, where a majority of objects will fall near the detection 
limit, the prevalence of aliased periods (at integer- valued cycles per day for a ground-based 
survey) with be debilitating for an automated classifier. We outline, in this paper, a pre- 
scription for detecting — and dealing with — sources with aliased periods. We also detail how 
to cross- match sources with external catalogs to obtain further classification features (e.g., 
color) and use a method to impute the values of those attributes when no match is detected. 

Finally, we use this methodology to create a calibrated probabilistic classification cat- 
alog for a set of 50,124 sources in the All Sky Automated Survey (ASAS; Pojmahski 1997) 
based on its publicly available ASAS V^-band light curve and colors. Our Machine-learned 
ASAS Classification Catalog (MACC) contains, for each source, posterior probabilities for 
28 different science classes. This is a wealth of new information compared to the existing 
ASAS Catalog of Variable Stars (ACVS; Pojmahski 2002), which classified a subset of these 
sources into 12 science classes without supplying any posterior class probabilities and giving 
the uninformative class label 'MISC to a majority of objects. In addition to probabilistic 
classifications, MACC gives an anomaly score for each ASAS source, which describes its 
proximity to objects in the training set. Furthermore, our catalog provides updated periods, 
peak-to-peak amplitudes, and dozens of other estimated features for each ASAS light curve. 
We ensure that all steps in the MACC catalog creation are transparent and provide a public 
interface to the catalog at www.bigmacc.info. 

The paper is structured as follows. In §2 we describe the ASAS catalog, our retrieval and 
pre-processing of the photometric data, and how we cross-match with 2MASS for infrared 
colors and impute the values of those colors when no match is made. We describe the 
classification methodology used to in the construction of the catalog in §3, including the 
definition of all light-curve features, description of the period estimation and aliased-period 
treatment procedures, and the derivation of our labeled training set. In §3.3 we describe 
the procedure to calibrate the classifier posterior probability estimates and in §3.4 we detail 
the computation of a semi-supervised anomaly score for each source. The MACC catalog is 
introduced in §4, where we describe the attributes of the catalog and how to access it. We 
compare the classifications of MACC to some other ASAS classifications in the literature in 
§5 and finally we conclude with a few remarks about future directions in §6. 
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2. Data 

2.1. ASAS Data Collection 

The All Sky Automated Survey^, is an ongoing, long-term project dedicated to the 
detection and monitoring of the photometric variability of bright stars (Pojmahski 1997). 
Since August 2000, ASAS has monitored bright stars {V < 14 mag) in the entire available 
sky south of 6 < +28° from Las Campanas Observatory. ASAS uses two small wide-field 
telescopes to monitor the sky with V- and 1-band filters. Each ASAS telescope takes repeated 
180-second exposures using a 2K x 2K CCD camera with 15-/im pixels, covering 8.5 x 8.5 
deg^ of the sky (see Pojmahski 1997 for further details). 

To date, ASAS has taken more than 267,260 V-band frames, imaging approximately 17 
million stars of l^-band magnitude between 8 and 14. Of these 17 million objects, ASAS has 
identified 50,124 variable stars and published the results in the ASAS Catalog of Variable 
Stars (Pojmahski 2002, 2003; Pojmahski k Maciejewski 2004, 2005; Pojmahski ct al. 2005). 
The catalog, which contains a rough classification for each source, is made publicly available 
through the ASAS website, along with l^-band light curves for 15 million ASAS sources. 
For the 50,124 variable stars, ASAS has retrieved a median of 541 usable epochs of V^-band 
measurements. The ACVS light curves range in the number of good detection epochs from 
3 to 2232. 



2.2. ASAS Photometric Light Curves 

We retrieved the ASAS ACVS dataset by first referencing the AC VS. 1.1 catalog, which 
contains 50,124 variable stars, and individually retrieving the data for each source from the 
ACVS website'^ These sources were imported into our DotAstro.org (http : //dotastro . 
org) astronomical light-curve warehouse for visualization and use with internal frameworks 
(Brewer et al. 2009). Each ASAS source's time-series data file is partioned by its observed 
field and includes information on the quality of the aperture photometry for each epoch, as 
well as magnitude measurements (and uncertainties) from up to five different apertures. Prior 
to importing the data we chose a single aperture for each epoch using the method detailed 
below and excluded epochs with a quality GRADE=D or quality GRADE=C when MAG=29 . 999, 
which, as detailed in the ASAS data files, corresponds to a non-detection. Given to the 



^http : / / www . astrouw . edu . pl/asas/ 

•^available at http: //www. astrouw. edu. pl/asas/?page=acvs. 
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undue influence of extreme pliotometric outliers in light curve feature estimation, before 
the generation of time-series based features, we applied sigma clipping to each ASAS light 
curve, excluding observations that lie beyond 4 standard deviations from each source's mean 
magnitude. 

ASAS provides five aperture measurements using annuli ranging from 2 pixels (30") to 6 
pixels (90"). Although the ASAS team outlined a basic algorithm for choosing which aperture 
to use for each source given its average magnitude (Pojmahski et al. 2005), we decided to 
use our own magnitude-dependent aperture cuts. Our procedure is the following: we begin 
by determining the aperture which has the minimum magnitude variance for a source^. The 
idea behind using minimum magnitude variance is that apertures that are too small will not 
capture all of the source's flux, resulting in larger Poisson noise in the measured brightness 
of the object, whereas apertures that are too large will incur more background noise and 
noise due to contamination from neighboring sources. For each aperture, we can visualize 
the distribution of mean magnitudes for the sources whose minimal magnitude dispersion 
occurred in that aperture (see Figure 1). This information was subsequently employed 
to construct a simple kernel density estimation (KDE) classifier to determine the optimal 
aperture to use for each source, as a function of its mean magnitude. Using this classifier we 
determine the optimal aperture for each ASAS source, as a function of its average magnitude 
(more precisely, the median magnitude of its five aperture-wise average magnitudes), and 
only import the light curve measurements from that aperture. The optimal magnitude cuts 
for each aperture are over-plotted in Figure 1. 



2.3. Querying the Naval Observatory Merged Astrometric Dataset 

In addition to information gleaned from single-band light curves, color information is 
invaluable to classifying variable stars. To generate color features, we use the Naval Obser- 
vatory Merged Astrometric Dataset [NOMAD ; Zacharias et al. 2004) to obtain single-epoch 
B, V, R, J, H, and Kg-hand photometry for each ASAS object, which we use to compute 5 
color features {B — V,V — R, R — J, J — H, H — Kg) for each source. Although the ASAS 
AVCS catalog provides cross-correlated 2MASS colors, the additional optical filters provided 
by NOMAD supplies a richer set of colors to aid the classifier. Due to the large ASAS posi- 
tional errors, we decided against using simple spatial cross-correlation to match each ASAS 
source to the NOMAD catalog. Instead, we train a ML classifier which takes as input 7 
positional and photometric features to determine whether a NOMAD candidate is indeed a 



■^Within the field with the greatest number of observations for that source. 
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Fig. 1. — Aperture- wise histograms of the average magnitude of AS AS sources whose minimal 
magnitude dispersion was observed to be in the specified aperture. As expected, brighter 
sources experience smaller dispersion when observed in wider apertures and fainter sources 
show smaller dispersion in narrower apertures. Using these histograms, we construct a kernel 
density estimation classifier to determine the optimal aperture to use for each ASAS source as 
a function of its average magnitude of brightness. The magnitude cuts from this procedure 
are overlaid in vertical lines and the total number of ASAS objects extracted with each 
aperture are listed in the figure. 
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match to the ASAS star. In addition to the separation distance between the ASAS source 
and NOMAD candidate, we employ the NOMAD nearest neighbor rank (ordered by dis- 
tance from the ASAS source), magnitude differences in J, H, and Ks bands and J — Kg 
color difference between ACVS and NOMAD , and the K-band difference between the ASAS 
light curve mean magnitude and NOMAD to allow a richer view of each source which will 
facilitate the ASAS-NOMAD matching procedure. 

This ASAS-NOMAD association classifier was initially trained using 48 ASAS sources 
of known class, sampled from 24 science classes, with 2 sources from each science class taken 
from the literature. For each of these training objects, we manually determined which source, 
from a NOMAD catalog query around position of the ASAS source, was truly associated with 
that object. These sources were classified as 'match', while all other sources returned by the 
NOMAD query were classified 'non-match'. 

Using the 7 positional and photometric features, we initially trained a random forest 
(RF, see Brciman 2001) classifier on the 48-object training set and applied the classifier to 
predict match/ non-match for a sample of 30,000 of the ~500,000 NOMAD sources which 
are retreived when the NOMAD catalog is queried around each of the 50,124 ASAS sources 
in our data set. Using the active learning technique of Richards et al. (2012), we selected 
17 NOMAD sources which would have high impact in improving the performance of the 
classifier, and manually classified each as a 'match' or 'non-match', and subsequently added 
these objects to the training set. This active learning process was performed over 10 it- 
erations, each adding 17 sources, resulting in a robust classifier which can accurately and 
automatically decide whether a NOMAD source is associated with an ASAS source based 
on the positional and photometric features. 

Ultimately, the classification algorithm was applied to each ASAS source to find the 
matching NOMAD entry, if any. For each ASAS star, we find the NOMAD source with 
the highest classifier probability of 'match', with preference of spatially closer matches when 
identical probabilities are returned for multiple NOMAD sources. If, for an ASAS object, no 
NOMAD source achieves 'match' probability > 50%, then we decide that no NOMAD source 
exists for that object. When applied to all 50,124 ASAS sources, we find that 93.9% of these 
sources match a NOMAD source. 

For the 47,044 objects with a NOMAD match, we extract the 5 NOMAD color features 
for use in the variable star classifier. For the remaining 3080 objects with no NOMAD match, 
we impute their colors using the MissForest imputation routine of Stekhoven & Biihlmann 
(2012). MissForest'^ is an imputation routine that uses a series of random forests to pre- 



^The R package missForest is freely available at http://cran.r-project.org/web/packages/ 



- 9 - 



diet the value of each missing feature based on the observed features for that source. The 
MissForest algorithm builds a random forest regression model (for real- valued features) or 
classifier (for categorical features) to predict the value of each feature from all of the other 
features. Beginning from some initialization of the missing features, the algorithm iterates 
until convergence is attained and outputs the predicted value for each missing feature in the 
data matrix. On multiple data sets, Stckhoven & Biihlmann (2012) show that MissForest 
outperforms other common methods, such as K-nearest neighbors and Lasso, in imputation 
accuracy. We employ MissForest using 100 trees. 

We test the accuracy of MissForest in imputing variable star colors by the following 
experiment. Starting with the set of 47,044 objects with a satisfactory NOMAD match, we 
null out the colors for a random 6.1% of the objects (the same fraction of ASAS objects with 
no NOMAD match). Then, using the leftover set of sources with known colors, we impute 
the nulled out colors using MissForest. This allows us to compare the true colors to the 
imputed colors for this subset of data, which we do using median absolute error (MAE), 

^(^j,imp) ^ mediauj Xj^- true ~ ^ij,imp (-'-) 

where x^j f^J■ue -^ijimp denote the true and imputed values, respectively, of color j for 
object i. MAEs, a, for each of the 5 colors in our data are reproduced in Table 1. While 
the MAEs for each color, particularly the optical-NIR colors, are larger than the typical 
uncertainty of the observed color for any individual source, we note that a large scatter is 
to be expected because we are imputing the observed color without reddening corrections. 
Indeed, an examination of the observed color for each class shows that the typical within- 
class scatter is > 2 mag, most likely owing to the various galactic latitudes at which the 
ASAS sources are observed. The imputation procedure confidently identifies stars as being 
either red or blue, and the obtained accuracy of these imputations is similar to the typical 
scatter in the observed colors, which gives us confidence that the procedure is sufficient for 
classification purposes. 



3. ASAS Variable Star Classifier 

Probabilistic supervised light curve classification has recently received much attention 
in the literature. For example, Debosscher et al. (2007), Dubath et al. (2011), and Richards 
et al. (2011) have applied modern machine learning methods to ~25-class variable star 



missForest. 
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problems using photometric light curve data from the Hipparcos and OGLE surveys. This 
automated classification methodology consists of the following two-step process: 

1. From each light curve, a set of m features (e.g., period, amplitude, etc.) is extracted. 
These features are constructed to capture the class-predictive information encoded 
within each light curve. 

2. Using a training set of objects of known class, a classification model, which maps from 
m-dimensional feature space to the set of classes, is fit. Methods such as neural nets, 
decision trees, support vector machines and random forests are classification models 
that have been used for light curve classification. The fitted classification model serves 
as a class prediction engine. 

Once the classifier has been trained, it is trivial to predict the class of each variable star, 
which entails first extracting the feature vector of the object and subsequently inserting that 
vector into the classifier to obtain a prediction. Many classifiers, such as random forest, 
produce a vector of posterior class probabilities for each object. 

To construct the ASAS variable star classification catalog, we use a set of m = 71 
features: 66 light curve features and 5 colors (described in §2.3). See §3.1 below for a 
description of the features used. We use a random forest classifier, which has been shown 
to attain high levels of accuracy in variable star classification by Dubath et al. (2011), 
and Richards et al. (2011). Richards et al. (2011) found that the random forest classifier 
attained the lowest error rates in classifying Hipparcos and OGLE variable stars in a side- 
by-side comparison with a dozen other classification models. In §3.2 we describe how to 
attain a training set for ASAS to minimize classification errors due to sample selection bias 

Table 1. Color imputation median absolute errors using the missForest imputation 

method. 
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(see Richards et al. 2012 for a thorough discussion of sample selection bias for light curve 
classification). 

3.1. Light Curve Feature Extraction 

Raw light curve data consist of measurements of a source's brightness over unevenly 
sampled epochs. From these data, our challenge is to estimate a set of features that are 
predictive of each source's class (e.g., it is well known that period, amplitude, and color are 
all highly predictive of class for certain classes of pulsating variable star) while being agnostic 
to other latent factors which are unrelated to (or at most, mildly correlated with) an object's 
science class. Examples of such latent factors are that each ASAS light curve consists of a 
different number of epochs (ranging from 3 to 2232 epochs with median of 541), over a 
different time baseline, with distinct noise properties and differing cadences. Furthermore, 
each ASAS source has a unique mean brightness (from 4th to 15th magnitude in V), resides 
in a unique position in the sky, and has its light affected by more or less intervening dust. 

We have constructed a set of 66 light-curve features meant to capture the essence of 
photometric variability of the science classes of interest, and have written algorithms that 
efficiently compute these features from light curve data, in an average of 4.5 seconds per 
ASAS light curve. In Richards et al. (2011), a set of 52 features was used to represent each 
variable star. Below, we describe the additional features that have been used in this study, 
and also outline some modifications to the algorithms used for periodic modeling. 

3.1.1. Computationally Efficient Regularized Fitting of Periodic Signals of Arbitrary Shape 

In this study, we employ a novel fitting routine which seeks to simultaneously discover 
the true period of a source while also modeling the light curve in detail. 

We begin by applying our fast Lomb-Scargle algorithm (fit of single sinusoid; Richards 
et al. 2011) to discover all marginally significant periods for a given light curve on a broad 
frequency test grid (t'min = 1/7", z/max = 10, 5v = 0.1/T cycle/day, where T is the data 
timespan). For test frequencies where the power-spectrum has a value > 6 (i.e., <1% of test 
points, corresponding to roughly 3.5cr significance), we fit a multi-harmonic model, 

8 

nii = cti + ^ An sm.{2'KijQnti) + Bn cos{2'7ih'onti) + bn,o (2) 

n=l 

consisting of a sinusoid at the initial frequency, uq, plus sinusoids at each of the n = (2, 8) 
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harmonics of that initial frequency. We choose = 8 to allow for sufficient model complexity 
to account for the light curves under study. The fitting of model 2 is performed with a 
regularization penalty to avoid over-fitting, and the number of effective model degrees of 
freedom is typically well below the allowed value of 2 x 8 = 16. 

In the fitting, we minimize 

R = f:^-^^^P^ + NX*j2^\Al + Bl), (3) 

i=l * n=l 

with respect to the model parameters 6 and the regularization parameter A. Here, the 
photometric data are di, the model is rrii, N is the number of data points, and ^y A'^ + 
is the amplitude of the nth Fourier harmonic. The second term above effectively penalizes 
the model in proportion to the magnitude of its second derivative. Small values of A result 
in models with high-frequency structure, whereas large A values yield more smooth, slowly 
changing models. For fixed A, the best fit parameters can be found by least-squares. We 
identify the optimal value of A using generalized cross validation (Golub et al. 1979; Craven 
& Wahba 1979). This allows the data to drive the complexity of the model while also 
constraining the model to not over-fit the data. 



3.1.2. Novel Light- Curve Features 

In addition to the 32 periodic and 20 non-periodic features used in Richards et al. 
(2011) to parametrize variable stars, we add 15 new features based on our generalized 
Lomb-Scargle periodogram, of which ten were also used by Long et al. (2012). These fea- 
tures are compiled in Table 2. The first two features are f req_ainplitude_ratio_21 and 
f req_amplitude_ratio_31, which are ratios of the amplitudes of the the second to first and 
third to first frequencies, respectively. We also add three features aimed at detecting eclips- 
ing sources from the Lomb-Scargle model in Equation (2), phased on twice the Lomb-Scargle 
period. We compute the phases and magnitudes of the two distinct minima and two distinct 
maxima of the phased light curve model. The feature f req_model_max_delta_mags is the 
absolute value in the magnitude difference between the two model light curve magnitude 
maxima (i.e. eclipses), and should be non-zero if the source is an eclipsing binary. Similarly, 
the feature f req_model_niin_delta_mags captures the absolute value in the magnitude dif- 
ference between the two magnitude minima and the feature f req_model_phil_phi2, which 
is constructed to detect eccentric binary systems, is the ratio of the phase difference between 
the first minimum and the first maximum (i.e. primary eclipse) to the phase difference 
between the first minimum and second maximum (i.e. secondary eclipse). 
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Additionally, we introduce the feature f req_n_alias, which counts the number of fre- 
quency estimates that are consistent with a 1-day alias (see §3.1.4 for details of this pro- 
cedure). This feature supplements the freq_signif feature to determine whether a source 
is, in fact, periodic. A source with estimated period at a 1-day alias often has a large 
freq_signif value even when the light curve is truly aperiodic, in which case it will be 
identifiable as aperiodic by a non-zero value of f req_n_alias. We further add the class- 
specific feature f req_rrd, which indicates whether any of the frequency ratios are consistent 
with 0.746, which is the frequency ratio enjoyed by double-mode RR Lyrae variable stars. 

Finally, we add the following 5 features which are adopted from Dubath et al. (2011). 
The feature scatter_res_raw computes the ratio of the median absolute deviation (MAD) 
of the residuals of the Lomb-Scargle model to the MAD of the raw light-curve. The fea- 
tures p2p_scatter_2praw, p2p_scatter_over_mad , and p2p_scatter_pf old_over_mad 
are the sum of squared differences of the scatter about the light curve phased on the Lomb- 
Scargle period to that of either the phased or raw light curve data. Similarly, the fea- 
ture medperc90_2p_p is the 90th percentile of the absolute residual values around model 
phased on twice the Lomb-Scargle period divided by the same quantity for the residuals 
around the model phased on the Lomb-Scargle period. Furthermore, we develop two new 
features, f old2P_slope_10percentile, and f old2P_slope_90percentile, which are the 
10th and 90th percentile slopes of the Lomb-Scargle model around twice the period, in- 
tended to capture the steepness of the ingress and egress of eclipse. Lastly, we add the 
feature p2p_ssqr_dif f _over_var from Kim et al. (2011), which is the sum of squared mag- 
nitude differences in successive measurements divided by the variance. 



3.1.3. Correcting Eclipsing Periods 

Comparison of our estimated periods with those from the ACVS catalog reveals that our 
period estimates are often exactly half of the ACVS period for sources which are classified 
as eclipsing binaries by ACVS. Of the 5913 objects that are classified as eclipsing binaries 
in ACVS, our period estimate matches the ACVS period for only 1339 sources (23%) and 
was exactly one-half of the ACVS period for 4162 sources (70%). After visual inspection of 
some of these light curves, we find that for eclipsing binaries in which our periods differ, the 
ACVS period is correct for most (but not all) of the objects. Using a visually confirmed set 
of 150 eclipsing sources in which our period is exactly one-half of the true period and 150 
eclipsing binaries for which our period is correct, we construct a supervised machine- learned 
random forest classifier on all of the features described in (§3.1) to automatically discover, 
for each eclipsing source in the data set, whether our estimated period is correct or wrong 
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by a factor of one-half. 

In this classifier, the most important features in determining whether our period is cor- 
rect are, unsurprisingly, the f req_model_max_delta_mags, f req_model_niin_delta_mags, 
and f req_model_phil_phi2 features, which capture differences between the primary and 
secondary eclipses, and f reql_harmonics_amplitude_l, the amplitude of the first har- 
monic, which will be large for an eclipsing binary containing two unequal eclipse depths 
that was incorrectly identified as having period one-half of the true eclipsing period. We 
apply this classifier to all 11,169 sources in our data set that were either classified by ACVS 
as an eclipsing binary or whose most probable class from our variable star classifier was one 
of the eclipsing binary classes. Of those sources, the classifier determined that our period 
was correct for 5807 objects and that our period was wrong by a factor of 1/2 for 5516 
sources. Doubling the period of those 5299 sources yielded a significant boost in the period 
agreement rate with the ACVS eclipsing binary stars, with 4225 of 5913 (71%) of those 
sources resulting in a period match. 

In Figure 2 we plot, for the 12,008 AS AS sources which the ACVS confidently classified 
into a single periodic class (i.e., not classified as "MISC" and not listed in multiple classes), 
the ACVS period versus our estimated period. Our agreement rate with ACVS is 77.2% on 
these objects. Including matches to half and twice the ACVS period yields an agreement rate 
of 92.9%. For 14.1% of the ACVS periodic sources, our period finder estimates a best-fit 
period of exactly half the ACVS period and for 1.6% of the sources our period is double 
the ACVS period. Though some of these cases may be errors on our part, they are not 
debilitating to the variable star classifier because 1) eclipsing binaries are not constrained 
to inhabit a narrow period range, and 2) other eclipsing binary features are more useful in 
detecting binary systems and sub-classifying them into their physical class. 

3.1.4- Treating Aliased Periods 

In a ground-based survey, aliases are common at 1-day periods due to the rotational 
period of the Earth. Large samples from photometric surveys are generally filled with quasi- 
periodic and non-periodic sources and low S/N periodic light curves, making 1-day aliases 
prominent. The prevalence of 1-day aliases in AS AS, as well as aliases at each integer number 
of cycles per day, is easily seen in a plot of estimated period versus statistical significance of 
the period (Figure 3). There is a clear division between aliased and non-aliased sources, as 
aliases objects tend to have smaller statistical significance, as exhibited by the over-density 
of objects around each aliased period with significance < lOa. 
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1e-01 1e+00 1e+01 1e+02 1e+03 



Period (ACVS) 



Fig. 2. — Period estimated by our period-finding algorithm versus the period stated in the 
ACVS catalog, for all 12,008 AS AS periodic sources in ACVS. The red dashed line denotes 
perfect agreement; for a total of 9280 of the stars (77.2%) we find periods that exactly match 
the ACVS period. For 92.9% of these sources, our period estimate either matches the ACVS 
period exactly or is different by a factor of two. 



17 



We use the period-period-significance plane to determine whether each source's period 
is an ahas. At each period P = 1, 1/2, 1/3, 1/4 days we perform the following experiment. 
First, we randomly sample 25 objects from a small window around the period, P. Using 
the ALLSTARS web-based visualization tool'\ we decide whether each object's period is truly 
aliased. Next, we sample additional objects whose determination of alias versus non-alias is 
not clear from the initial sample; this is achieved by fitting a 5-nearest neighbor classifier to 
the initial set of 25 objects and only selecting objects whose 5 nearest neighbors disagree in 
a 3-to-2 ratio. These objects are again verified manually and the process repeated until the 
phase space is appropriately filled in. Finally, we find that a function of the form 



produces an acceptable separation boundary between the aliases and non-aliases around 
each period P = 1, 1/2, 1/3, 1/4, where Xi is the estimated period for object i and Sp(xj) is 
significance below which the object is deemed to be an alias. Using the above sample, we 
find the values of {<yi,p,<y2.p) that minimize the sum of the squared residual distances for 
objects whose alias-ness is misclassified. In Figure 3, we plot the estimated alias-decision 
boundary for 1-day aliases, along with the training sample used to determine that boundary. 
This allows us to determine, for any object near an aliased period, whether it is aliased, and 
enables us to select its next-best non-aliased period as the true period and to compute the 
afore-mentioned f req_n_alias feature. 



For completeness, we plot the importance of each feature in the classification random for- 
est in Figure 4. The RF feature importance measure describes the decrease in overall classifi- 
cation accuracy that would result if the feature were replaced by a random permutation of its 
values. See Brciman (2001) for further details. In Figure 4 we find that the fundamental fre- 
quency of oscillation (i.e. period) of the light curve is by far the most important feature in the 
classifier. Other important features include estimates of the light curve skew, measurements 
of amplitude/variability (stetson_j, std, median_absolute_deviation) , various colors, 
and features extracted from the light curve folded on twice the period. One caveat to the 
feature importance measure is that it does not account for correlations between features. For 
instance, the standard deviation and median absolute deviation of the light curve both pro- 
vide measurements of the spread in the flux measurements about the average value; thus, the 



^Active Learning Lightcurve classification Service, see Richards et al. (2012) for more details. 




(4) 



3.1.5. Feature Importance 
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Fig. 3. — Top: Period versus period significance (in number of a) for ACVS objects, esti- 
mated from their ASAS l^-band light curves. There are clear over-densities at 1-day, 1/2-day, 
etc. periods (as denoted by the arrows). The aliased objects typically have small statistical 
significance in the period. Bottom: Zooming into the region around 1-day periods, we see 
a clear division between objects with aliased (red A) and non-aliased (blue o) periods, as 
verified by manual study. We find that a function of the form of Equation 4 separates the 
aliased and non-aliased objects; we use this function to decide whether each source is aliased 
at a 1-day period, and similarly for other aliases. 
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conditional importance of std given median_absolute_deviation is quite low even though 
their individual importance measures are both large ' . 

3.2. Training the Classifier 

Non-parametric supervised classification methods, such as random forest, require a 
training set of data with known class label to learn the mapping from feature space to 
classes. Once this model is learned, data from each ASAS source can be trivially fed into 
the model to attain probabilistic classifications for each object. However, much care must be 
taken to attain a training set that is representative of the ASAS data. If significant discrep- 
ancies exist between the distribution of training features and the distribution of the features 
of the ASAS data, then, as shown by Richards et al. (2012), significant biases can occur 
in the ASAS classifications due to poor model selection and catastrophic errors caused by 
sample-selection bias. In this section, we detail the construction of our classification training 
set and efforts to avoid sample-selection bias. 

As the base training set for the ASAS classifier, we use the training set of confirmed 
Hipparcos and OGLE sources used in Richards et al. (2011) (which is based on, but slightly 
different than the training set used by Debosscher et al. 2007). This data set consists of 1549 
variable stars from 27 different science classes'^. Next, we cross-match the Hipparcos training 
set with our ASAS sample, finding 268 matching sources. For these 268 training objects, we 
replace their Hipparcos light curves with their ASAS light curves in the training set. At this 
stage of the analysis, we also choose to exclude four variable star classes: Lambda Bootis, 
Slowly Pulsating B, Gamma Doradus, and Wolf-Rayet. Each of these classes of variable star 
is populated by objects whose amplitude of variability is AV < 0.05 mag, which is below 
the ACVS variability selection cut of 95th percentile in the magnitude-dispersion diagram 
(Pojmahski 2002). Indeed, of the 113 variable stars in our Hipparcos training set that belong 
to one these four classes, not a single star passed the variability cuts used to construct 
the ACVS catalog, even though 78 of the 113 stars were observed by ASAS. Because such 
prototypical examples of each of the four small-amplitude classes did not satisfy the cuts 



''Dubatli et al. (2011) account for this by itcratively removing features that are highly correlated with the 
most important features. We choose not to perform this preselection because we find that some redundancy 
in the features helps the performance of random forest classifier, especially when using features that are 
robust to outliers. 

^Note that this training set is slightly different than that of Richards ct al. (2011) in that we further 
split the T Tauri class into Classical (9 stars) and Weak-line (2 stars) subclasses and add the SX Phocnicis 
variable class. 
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Fig. 4. — Random forest feature importance for the top 20 features, as estimated by cal- 
culating the mean feature importance over 5 random forest classifiers. As expected, the 
fundamental frequency of oscillation is the most important feature in ASAS variable star 
classification. The next most important features include the skew of the flux measurements, 
the Fourier model amplitude of the fundamental frequency, the B ~ J color, the ratio of the 
standard deviation of the scatter about the Fourier model to the raw observed scatter, and 
the Stetson variability index J (Stetson 1996). Error bars denote the standard deviation in 
the feature importance over 5 random forests. 
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used to construct ACVS, we do not expect to find any objects of tliese classes in tlie ACVS 
sample. 

The feature distribution of this initial training set is substantially different than the bulk 
distribution of ASAS features (see Figure 1 of Richards et al. 2012). In Richards et al. (2012) 
it was exemplified that this mismatch causes poor performance by supervised ML classifica- 
tion and demonstrated that an active learning framework could be used to supplement the 
training set in a statistically rigorous manner. Active learning is a classification paradigm in 
which the supervised classifier is able to query the human user for the classification labels of 
a subset of sources with unknown class, whereby these objects are manually labeled by the 
user and added to the training set. Using a random forest classifier, the active learning query 
function 5*2 from equation (5) of Richards et al. (2012), and the crowd sourcing methodology 
outlined in that work, we add 407 ASAS sources to the training set. 

In addition to the 407 active-learning training sources, we supplemented the classifi- 
cation training set with matched sources from the SIMBAD catalog (Wenger et al. 2000) 
using a combination of algorithmic catalog matching, literature searching, and human vet- 
ting. Starting with the list of NOMAD sources associated with ASAS sources (see §2.3), our 
algorithm looks for a SIMBAD source which is spatially close to the NOMAD source, calling 
a match any SIMBAD source which is within 0.5 arcseconds of the NOMAD source. If no 
SIMBAD source fits this constraint, then no association is made. Our primary purpose for 
this exercise was to strengthen the training set for under-represented science classes. Thus, 
for any positive SIMBAD association of class RV Tauri, Population II Cepheid, Beta Cephei, 
Chemically Peculiar, T Tauri, or Herbig Ae/Be, we performed a literature search on the ob- 
ject, only including the source in the training set if it was definitely confirmed by multiple 
sources. This procedure allowed us to add 68 sources to the training set. At this point, we 
also added R Coronae Borealis (RGB) — a well-studied class of hydrogen- deficient carbon- 
rich supergiants that undergo episodes of extreme dimming (Clayton 1996) — to the training 
set, populating the training sample with 17 RGB stars found via the SIMBAD matching 
procedure. 

In a preliminary edition of the classification catalog it was noticed that an excessively 
large fraction of the AGVS variables, > 10%, were being classified as T Tauri stars (TTS). 
At the time TTS only constituted ~0.7% of the training set so the large fraction of TTS 
classifications was not expected. Upon further inspection we discovered that the inclusion of 
the two sub-classes of TTS, which exhibit significantly different photometric behavior, into a 
single class led to their significant overrepresentation in the final catalog. Thus, we decided 
to split the TTS class into two classes: weak-line T Tauri stars (WTTS) and classical T Tauri 
stars (GTTS). This split is physically motivated as WTTS are older young stellar objects 
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whose photometric variability is periodic and characterized by the rotational modulation 
of cool spots on the stellar surface; CTTS, on the other hand, are younger stars that are 
still actively accreting from a disk with a variability signature that is typically more chaotic 
than WTTS (for a review of TTS variability see Hcrbst et al. 1994 and references therein). 
To populate these two new classes we divided all members of the original TTS training 
set as well as new TTS identified via our SIMBAD-ASAS matching query, which included 
SIMBAD matches of type Y*0, Or*, pr*, or TT*'^. We split these sources into the CTTS 
and WTTS classes using the classical diving line between the two: for CTTS the equivalent 
width (EW) of Ha emission is >10 A, while for WTTS EWna <10 A (see e.g., Waher 1986; 
Strom et al. 1989). Stars were only included in the training set if we could find a published 
value of EWhq; which typically came from the catalogs of Herbig & Bell (1988) or Torres 
et al. (2006). 

It was later noticed that several known members of the RS Canum Venaticorum (RS 
CVn) class of binary stars were being classified as WTTS, which prompted us to add RS 
CVn stars as a new class in the training set. To populate the RS Canum Venaticorum 
class in the training set we identified matches between ACVS sources and the catalog of 
chromospherically active binary stars (CABS; Strassmeier ct al. 1988). The CABS includes 
both RS CVn and BY Draconis (BY Dra) binaries, both of which we include in the training 
set as the latter is the low mass analog of the former. In practice RS CVn and BY Dra 
stars exhibit the same photometric behavior, from a classification standpoint they can only 
be separated spectroscopically which is why we include them as a single class in the MACC. 
The cross-match between the CABS and ACVS produces 16 RS CVn and 1 BY Dra which 
we use to define the RS CVn training set. 

Our final training set consists of 1945 sources in 28 science classes. A total of 777 of these 
sources are observed by AS AS, so we use their AS AS light curves to derive features to train 
the classifier on. For the other 1168 training objects, we only have data in Hipparcos (644 
stars) or OGLE (524 stars), so we employ the light curves observed by those missions. A 
tabulation of the entire training set, by class, is given in Table 3. The implicit class prior 
in fitting a random forest classifier is the empirical vector of training-set class proportions, 
which is given in Table 3. 

Finally, we find the optimal random forest model by minimizing the 10-fold cross valida- 
tion classification error rate over the AS AS training set with respect to the number of random 
forest trees, ntree, the number of features considered on each splitting node, mtry, and the 



^Y*0: Young Stellar Object; Or*: Variable Star of Orion Type; pr*: Pre-main sequence Star; TT*: T 
Tau-type Star. 
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minimum size of each terminal node, nodesize. Performing a grid search over those three 
parameters, we find that the optimal model is ntree = 5000, mtry = 17, and nodesize = 1, 
attaining an average 10-fold cross validation error rate of 19.15% for the 777 ASAS training 
objects. For the remainder of this paper, and to construct the ASAS classification catalog, 
we use this optimized classification model. 

3.3. Calibrating Classifier Probabilities 

Using the features described in §3.1 and the training set outlined in §3.2, we fit a random 
forest classifier with optimized tuning parameters and use it to generate class predictions 
and full 28-class probability vectors for all 50,124 ASAS objects. A desirable property of 
probabilistic classifications is that they be calibrated. That is to say, if we consider all sources 
whose class probabilities for a particular class are 90%, then 90% of those objects should 
truly be of that class. Calibration is attractive because it allows us to treat the probabilistic 
classifier output as if it were truly a set of posterior class probabilities, P (class | x). Calibra- 
tion also allows us to easily substitute different prior class probabilities by multiplying the 
classification probabilities by the appropriate vector of prior ratios and re-normalizing the 
probability vectors (see 4.1 for a detailed explanation). 

However, the class probabilities estimated by the RF are not necessarily calibrated. To 
check their calibration we perform the following experiment. Using only the subset of ASAS 
training data (777 objects), we perform 10-fold cross-validation to estimate the random forest 
classification probabilities for each source^". This provides a vector of 28 cross- validated class 
probabilities for each object. Then, in each of 8 disjoint probability bins (chosen such that 
each bin contains at least 100 instances), we compute the proportion of the objects, ptme, 
that are truly of the specified class. If the probabilities were cahbrated, then the value of 
Ptme should match the mean random forest probability within each bin. In Figure 5 we 
see, by the solid black line, that this certainly is not the case for our classifier. Specifically, 
the random forest classifier tends to be conservative in that it systematically estimates a 
smaller probability than ptme for RF probabilities greater than ~0.3. For instance, in the 
RF probability bin centered around 0.5, around 70% of those objects are truly of the specified 
class. 

Two popular methods exist for calibrating classifier probabilities using simple transfor- 



Cross-validation ensures that each object is held out of the training set when fitting the classifier that 
is used to predict the class probabilities for that object. In this sense, the cross-validated classification 
probabilities are representative of the classifier probabilities for the unlabeled data. 
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Table 3. Class distribution of training set objects used to fit the probabilistic ASAS 
classifier. This class distribution defines the prior on class probabihties used to compute 

posterior class probabilities for each source. 



Science Class 


■^Train 


Prior P (Glass) 


Mira 


164 


0.0843 


Semireg PV 


101 


0.0519 


SARG A 


15 


0.0077 


SARG B 


29 


0.0149 


LSP 


54 


0.0278 


RV Tauri 


25 


0.0129 


Classical Ccpneid 


204 


0.1049 


T>„ TT 1 * J 

Popll CepneiQ 


27 


0.0139 


MultiMode Cepheid 


98 


0.0504 


RR Lyrae FM 


148 


0.0761 


RR Lyrae FO 


39 


0.0201 


RR Lyrae DM 


59 


0.0303 


Delta Scuti 


133 


0.0684 


SX Phe 


6 


0.0031 


Beta Cephei 


55 


0.0283 


Pulsating Be 


49 


0.0252 


PerVarSG 


55 


0.0283 


ChemPeculiar 


75 


0.0386 


RGB 


17 


0.0087 


ClassT Tauri 


12 


0.0062 


Wcaklinc T Tauri 


20 


0.0103 


RS CVn 


17 


0.0087 


Herbig AEBE 


22 


0.0113 


S Doradus 


7 


0.0036 


Ellipsoidal 


13 


0.0067 


Beta Pcrsei 


178 


0.0915 


Beta Lyrae 


202 


0.1039 


W Ursae Maj 


121 


0.0622 
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mations. Piatt Scaling (Piatt 1999) transforms the probabihties using a sigmoid function 
whose parameters are chosen via maximum hkehhood over the training set. Isotonic Regres- 
sion (Robertson et al. 1988; Zadrozny & Elkan 2001) is more flexible, replacing the sigmoid 
function with any monotonically increasing function (which is typically restricted to a set 
of non-parametric isotonic functions, such as step- wise constants). A drawback to both of 
these methods is that they assume a two-class problem; a straightforward way around this is 
to treat the multi-class problem as C one-versus-all classification problems, where C is the 
number of classes. However, we find that Piatt Scaling is too restrictive of a transformation 
to reasonably calibrate our data and determine that we do not have enough training data in 
each class to use Isotonic Regression with any degree of confidence. 

Ultimately, we find that a calibration method similar to the one introduced by Bostrom 
(2008) is the most effective for our data. This method uses the probability transformation 

- _ f Pij + r{l - Pij) if Pi j = max{pii,pi2, ...,Pic} /^.x 



where {pii,Pi2, ■■■,Pic} is the vector of class probabilities for object i and r G [0, 1] is a scalar. 
Note that the adjusted probabilities, {pii,Pi2, ■■■,Pic}, are proper probabilities in that they 
are each between and 1 and sum to unity for each object. The optimal value of r is 
found by minimizing the Brier score (Brier 1950) between the calibrated (cross-validated) 
and true probabilities^^. We find that using a fixed value for r is too restrictive and, for 
objects with small maximal RF probability, it enforces too wide of a margin between the 
first- and second-largest probabilities. Instead, we implement a procedure similar to that of 
Bostrom (2008) and parameterize r with a sigmoid function based on the classifier margin, 
Ai = pi,max - Pi^2nd' ^^^^ source. 



where the Brier score is minimized with respect to both A and B. This parametrization 
allows the amount of calibration adjustment to differ between objects with confident (high- 
margin) and ambiguous (low-margin) classifications. Indeed, as expected, we find that the 
proper amount of adjustment is low for stars with small RF margin (e.g., r(0.05) = 0.18) and 
higher for sources with large maximal RF probability (e.g., r(0.9) = 0.95). The parameters 
that minimize the Brier score over the training set are A* = —5.271 and B* = 1.754. 

With the Bostrom (2008) calibration procedure, we correct the RF probability estimates 
for all ASAS sources. To test the efficacy of our procedure, we plot, in the blue dashed line in 



^^The Brier score is defined as B{p) = J2iLi Yl'j=i^-^{Vi — j) ~ PijY j where N is the total number of 
objects, C is the number of classes, and I{yi = j) is 1 if and only if the true class of source i is c. 




otherwise 



r(Ai) 



1 



(6) 



]^ _|_ gAAi+B ' 
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Figure 5, the adjusted (cross-validated) RF probabilities versus true posterior probabilities 
for our set of 777 ASAS training set objects. The calibration is now substantially improved 
over the raw random forest probabilities and the calibrated probabilities are consistent with 
the true posterior class probabilities. Note that the adjusted probabilities are still slightly 
conservative in that, on average, the estimated probabilities are systematically smaller than 
the true probabilities for estimated probabilities greater than ~0.1. In Figure 6, we plot 
these reliability diagrams for each of four subclasses of variable stars. Within each of the 
four subclasses, the calibration has improved, with marked decrease in the Brier score for 
each subclass. 



3.4. Detecting Anomalous Objects 

Our calibrated ASAS probabilistic classification catalog supplies, for each object, its 
posterior probability of belonging to each of 28 science classes given its observed ASAS light 
curve and colors. These posterior class probabilities assume prior class probabilities given by 
the distribution of object types in the training set (see Table 3). The posterior probabilities 
also assume that the training set is fully representative of the set of ASAS data, meaning 
that all classes present in the ASAS data are represented in the training data and that the 
distribution of ASAS features is the same as the training set feature distribution. However, 
there is no guarantee that these conditions will be satisfied for each ASAS object, even after 
performing several rounds of active learning to reduce the discrepancies between the training 
and ASAS data sets. 

The challenge, then, is to identify ASAS objects which do not resemble any of the 
training data. Classifier predictions for these objects will be dubious due to the outlying 
nature of their feature vectors compared to the training set feature distribution, either due 
to their belonging to a class not included in the training set or anomalous features brought 
about by noise or atypical physical variability. To detect such anomalies, we compute, for 
each ASAS object, a distance metric from that object's feature vector to each source in the 
training set. In contrast to previous methods which compute distances between phased light 
curves for periodic variable stars to detect anomalies (Protopapas et al. 2006; Rebbapragada 
et al. 2009), we compute a distance measure between feature vectors. 

Similar to Bhattacharyya et al. (2011), we use a semi-supervised approach to compute 
the anomaly score for each variable star. We begin by fitting a random forest classifier to the 
training set as in §3.2. The random forest outputs a proximity measure pij, between each 
pair of sources i and j, which gives the proportion of trees in the random forest for which 
the feature vectors Xj and appear in the same terminal node. If two sources have similar 
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Fig. 5. — Reliability diagram for ASAS training data. The closer that the curve follows 
the diagonal, the better calibrated that the classifier probabilities are. The initial random 
forest probabilities (solid black line) are not well calibrated, as the RF probabilities tend to 
grossly underestimate the true posterior probabilities for large estimated probabilities. Using 
the calibration procedure of Bostrom (2008) results in well-calibrated adjusted probabilities 
(dashed blue line) as they are consistent with the diagonal of the reliability diagram. In the 
final ASAS catalog, we use this calibration procedure to adjust all of the posterior probability 
estimates. 
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Fig. 6. — Reliability diagrams for each of four subclasses in the ASAS training data. Within 
each subclass, the calibration procedure (dashed blue lines) produces better calibration than 
the raw, uncalibrated random forest estimates (solid black lines). Whereas the off-the-shelf 
random forest probabilities are systematically too conservative for large estimated probabil- 
ities within each of the four subclasses, the adjusted probabilities are more consistent with 
the diagonal for most probability bins. 
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feature vectors with respect to the topology of the random forest, then the proximity will be 
near 1, whereas if the feature vectors are dissimilar then the proximity will be near 0. Using 
the proximity measure, we define the discrepancy between the two feature vectors Xj and Xj 
as 

rf(x„x,) = ^ (7) 

which takes on non-negative real valued numbers. This metric is semi-supervised because it 
uses the labeled training set to construct the optimal random forest classifier, which is then 
used to compute proximities (and discrepancies) between labeled and unlabeled sources. 

The novelty of the distance measure in Equation 7 is that it automatically gives more 
weight to features which are important in the classifier while ignoring useless features. For 
instance, if a feature is important for classification, then the RF trees will make many splits 
on that feature, thus dividing the coordinate into many sub-regions. Hence, for a new source, 
the value of that class-predictive feature will have a great deal of power in determining which 
terminal node the source falls into for each tree, and thus will be a strong determinant of 
its proximity to other sources. Likewise, features that are unimportant for classification will 
never be split on by any tree, and thus proximities will be unaffected by their values. Unlike 
Euclidean distance, the proximity-based distance measure adapts to the geometry of the 
classification problem and can treat different regions of feature space differently based on 
the class boundaries and prevalence of training data in those regions. 

Using the RF proximity measure, we construct an anomaly score for each ASAS object. 
We first compute the distance, using Equation 7, from the feature vector of each ASAS 
source to the feature vector of every training source. We define the anomaly score for each 
ASAS object to be the distance (Equation 7) to the 2nd nearest neighbor in the training set. 
Objects with large anomaly scores should be considered as outliers and their classifications 
should not be trusted because there is too much discrepancy between the features of those 
sources and the set of training set variable stars. Note the subtle difference between the 
anomaly score and classification probability: sources with small maximal class probability 
may reside near training data but fall in regions of feature space shared by several science 
classes. At the same time, sources with high anomaly score may have large maximal class 
probability due to their relatively close proximity to the training objects of a certain class 
compared to the training objects of the other classes. 

The anomaly score provides a positive real-valued number for each object. However, 
we may ultimately want to make a decision, for each object, of whether or not that source 
is an outlier, by thresholding on the anomaly score. To determine an appropriate score 
threshold for anomaly detection, we employ cross-validation on the training set. In each 
of -fT = 10 cross-validation folds, we hold out a random subset of the ASAS training data. 
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fit the random forest classifier on the remaining data, and compute the anomaly score for 
each held-out object. Then, for each anomaly score threshold, we record the cross-validated 
classification error rate over the ASAS training data, counting each object whose anomaly 
score surpasses the threshold as an error. Results of this experiment are in Figure 7. As 
the threshold decreases, we identify more objects as outliers, but the classification error rate 
only becomes significantly affected for thresholds smaller than 10.0. Following the 1-a rule of 
Hastie et al. (2009) over 10 repetitions of the procedure, we find that the optimal threshold 
level is t* = 10.0. Therefore, we recommend that the 1741 ASAS objects with anomaly score 
larger than 10.0 be treated as outliers. 

In Figure 8 we plot the ASAS light curves of eight sources that are amongst the highest 
outlier scores. These objects include a light curve with only 3 epochs of data, two other 
light curves with with fewer than 15 epochs of data, a known emission-line star showing 
variability on long time scales (ASAS061940+1822.3), a likely Be star showing semi-regular 
pulsations with amplitude modulation (ASAS073246-1519.3), and a red star with quasi- 
periodic low-amplitude variability on 18-day timescales (ASAS185203-2937.7). Additionally, 
two of these objects show semi-regular pulsations of smaller than 1 mag on 150-200-day 
timescales (ASAS175200-5751.9 and ASAS191550-0128.2); these are likely to be blended 
Mira variables. For each of these outliers, there are no training instances that capture the 
observed variability in their ASAS light curves. 



4. The Catalog 

Here we describe the contents of the publicly-available Machine-learned ASAS Classifi- 
cation Catalog. MACC is available for download at www.bigmacc.info. The first 40 rows 
of the classification catalog are reproduced here in Table 4. The columns of the catalog are 
as follows: 

• ASAS.ID - ID from ASAS catalog of Variable Stars 

• dotAstro_ID - ID from the online database http://dotastro.org/ 

• RA, DEC - Coordinates from ASAS^^ 

• Class - Most probable class from the machine learned classifier 



^^Coordinates from ASAS are sometimes wrong by several arcsec due to its ^15-arcsec pixel size. This 
effect is worse in crowded fields. 
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Fig. 7. — Determination of the optimal anomaly score threshold via cross-validation. As 
the outlier threshold, t, is reduced, more objects are considered anomalies, and the cross- 
validated error rate increases (outliers are, by construction, assigned no label, incurring a 
classification penalty of 1). Using the 1-a rule, which chooses the smallest threshold for which 
the error rate is within one standard deviation of the default model with no thresholding, 
we find that the optimal threshold on anomaly score is t* = 10.0. Adopting this threshold 
for the ASAS data, we discover 1741 outliers. 
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Fig. 8. — Top outlier light curves as determined from the anomaly score. These light 
curves are the furthest from their second nearest neighbor in the training set according to 
the anomaly metric in Equation 7. These sources are outliers due to either too few data 
(a,b,c), long-scale secular trends (d), quasi-periodic behavior of irregular type (e), occupying 
an anomalous region of period-amplitude space (f), or high-amplitude variability suppressed 
due to blending with nearby sources (g,h). For each of these sources, there are no training 
instances that capture the observed variability in their ASAS light curves. 
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• P_Class - Posterior probability that the source is from that class, given the ASAS light 
curve and colors 

• Anomaly - Metric from §3.4; objects with score greater than 10.0 should be considered 
as outliers 

• ACVS_Class - Classification from the ASAS Catalog of Variable Stars (Pojmahski 2002) 

• Train_Class - If the ASAS object was in the training set, its training class, else blank 

• Mira,..., W_Ursae_Maj - Posterior class probabilities for all 28 science classes 

• P, P_signif - Best fit period (in days) and its statistical significance (in number of a) 

• N_epochs - Number of epochs in the ASAS light curve used to classify the object 

• V, deltaV - Mean ASAS V-band magnitude and ASAS V^-band amplitude 

MACC has been constructed to allow easy querying of objects of a specified science class, 
simple searching for outliers, and more advance queries on several attributes. In supplying 
the posterior class probabilities for each class, the catalog allows each individual researcher 
to define their own probability threshold in querying objects. For instance, imagine that 
scientists A and B are both interested in finding Mira variables, but scientist A requires a 
highly pure sample while scientist B simply wants the top 3,000 Mira candidates, even if a 
substantial number of these are non-Miras. Then, scientist A could use a strict threshold, 
selecting all candidates with P(Mira) > 0.95 and Anomaly < 10 (resulting in 2,131 very 
likely Mira candidates), while scientist B would simply grab the 3,000 objects with largest 
P(Mira) (which, in this case is equivalent to a Mira probability threshold of 0.470). 

In addition to a full catalog download, the online catalog at www . bigmacc . info provides 
several ways for users to interact with the data. For users familiar with the SQL query 
language, the QUERY page enables users to specify SQL queries with conditions of their 
choice. A Basic query provides the simplest SQL interface for the data, while the Advanced 
query allows fine-grained control of the output. The BROWSE page allows users to explore 
the catalog by the most probable class determined by our classifier. A visualization of the 
tree of science classes enables a novel way of exploring the variable star taxonomy while 
querying and presenting the data in real-time. Examples of visualization of the data are 
given in the VISUALIZE section along with guides on how to create these and other plots 
without downloading the data. See Figure 9 for a screenshot of the website. 

In all the tables on www . bigmacc . info, clicking a table header will sort the data by that 
column and customization of the columns can be done in the Customize Table dialog. Certain 



oo^lo^O'-^^-^^o^cO(X)lOlOlOI^a)t-LnoO'^«DOoox^coc^^'3*OM'^QOloo^ 
'1<'1<L^--^^csL:3L:3L:3MMOco^ac^cooom'^M'^'^lnln'^^^'-^'^co^-M 

c^dd'-HdddddddddddddddrHdddood^ 



. O^OOOCNCNCNCNi 



„34, 



CO CO CO CO o ■ 



>-< >-< >-< ■ 



9 r-I 



^-ocO'-^cO'-^^-^o;cocN^mL^co^-^c^t>'^MO^c^ln'^^^o^-'^*GO'^<cNo^^^^o^ocNOT-^lo 
^coci^c^'l<oococ^ic^^rsio^^^co^^r^Ln'H^M^cO'3*'3*cococo^c^coc>icO'i''i'com 



OiOi^cOini-i ;0'HI^'HOCNMt0^iO^'H':J<t^C0'^I>O COtOCOCNtOt^OliOiOiO 
co^-cDc^^t^'1^int>o^'^coO'^OlO(»(»I^t^liX^X£^l^-£^lO^^-a)'^0'^^-^^OT-^ot- 



t- ^ ;0 to 



in 



_. COiOtH oocncoooco 
cNOil>t£!-^co©QOaO'#h-i-icNoo^i-imi-im<£i 

^ . ^ C« CO . . . ^. . ^ . . 10 (X} t£) t£) ^ . W -H ^ LO . 
COCNCN'-H'tCNOOOlOOOOC^COCOi-lO^O'HtMt^OfOOOCO 



^ CO 01 lO 

^M^-T^^M^-c^^QOcO'*ISlOO^T-^o^moOT-^l 



'fl'OOf-OOOO' 



^ CM CM ^ T-H 

O O GO o o 
O O CO o o 



IT) CO 00 

000 
000 



Oi in . 
O ( 



lOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 



Pi 



Pi a 



a 



m ffi cn 



H 
O O 
m a S 



O 
H 



DDDDD>000 
Omen; 



L^^r)T^<o:l^-^^GOc3;c^^OlO^(^lcocoo^^l^llnM^-'^'^^^oc^lc^^cO'--l'--l ^t- c^gocoi 
co^o^-c7)^-cOI-^cooo^-cococOI^looo^cNo^toc^lO©©^-<Bloc^^cOlna^T-llJO'#t-' 



O --H 10 C-J O 1 



I O CN CO CN 



^coi>coinoO'Hi>^'^oO'*oiocNioiot^oe30 



'3<co<Nt-r<icOi!OC^i'*;ooot-t- t^ioio^-n^totMOico^o^co t^^DGOc:icNio c:icot^m 
coi^ooaoioc7;ot-^t-o^^'-it-tot-oi®p:it^oioiMoa5'-i aioioi'a't-cjiT-icocjioo: 
GOOicocNtr30)C3)C3)CNC3)QDooinioi©c3i^iDcoincoOTOTcococMm cjimaicOiot3i^»ot3icom 



lOOOOOOOOOOOOOOOOOOOOOOOOOOO' 



00000000000 



« Dh 

2 m m 2 



: I H I ■ 

. 0) ^ 0) 

' ^ -V -w 

^ ^ ^ if if 



o o 

_ cS cS 0, 

s < < 



i Ctf : 

<: 



H S ^ 



' P- 

: Eg 



Ijh ^5 ^5 > I-lh 



to 



<v a (a a (a 



i~, i~, b!) a ~~ ~~ 



^ ^ ^ m 



CO 10 01 

in I> 00 CO 10 

^ O CO Oi i-l 

CO 00 00 CO 

00 CO CO O CO 
OOI^CN^tNCJ)«Dt- 

^ ^ rrh ^ ^ 



^ ;o ' 



CM CO 

a; 00 
00 h- , 



00 tr~ G) G) ^ t~ Oi 

OOi-HOOQOQOi-lOOGCOO, 

t-t-t-CNUD^O-^CN 

looi^ot-t-cNOin 

■i-HUDCTJOOUDOO'^COCO 
UDOOOlOlO^tN-HOO 



CD-HCN O'HinCNCO'^*' 

OT-^t^y:) t-cocMinoO' 
incoin. toOicooO'^cN; 
cN'^coi~;GOcNioio.-Hini 



i«DC0t^OiniiO00 .tNt>t>tNai(O'^01C3i^C3i'^O' 
'■cootOcocoioco^m<Otin'<o>'HCM'*m'-HCMt-' 
i^ldlllllllll^lll^llll 



23 M CO 10 °; I- 



CO 01 'J c:^ ■ 



loioioiomm m m loio loioio 10 lOinio in in inin 

^-^-lnooocNCNC3;co;OT^^OI^^-c^l^lnoool'^J<Ol^-■^^^lnoococ^^Ol'*^-'*lnt-^oox 
coco^-C!^ln^-Olnc^c7lt-lnoocslnI-^^D'^t^i(^lOlnt-•^^ic::lt>■ln'*lnooc::loOI^^:-- 
^-ooooOlncJl^-OlntNT--l^-^D^c7)^oOl^lOllnt-■'^J<oo^^i^^ilnlnoooIH^ot>-^-r-r- 
c^^cocNco^-c^^^c^^co^oomI-lc^lCNcOI-l(^lT^^^-ooolOlcooOGO'^wOlncOGO 



O O O O O 1 



in in in m 

O (N CO to 

CO 'I' O CO 

i-l 00 CO -H 

" in (31 CM 



;c^c^c^c^cNcocococo^^':j<':j<':j<':^':j<ininincotot~t~i>-i>ooooooc3ic3ic3iC!io 



OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO' 



ininininininininininininininininininiDiDiDiDininininininininininini^ 

C^C^C^C^C>IC>l(N(N(N(N(N(N(N(NrjrjrjrjCNCNtNtNtNtNCNCNCNCNCN 



CMCoco-vt^'l'inooinh-in'-'h-Ot-oioo^coiot-'HcoooiniiOco 
cO'^'^dc7i^^d^^ddd^^^'^ln^lnd'^J^^-^dI>coddc7^ 
ln'3*■-^cOT-^cocOT-^cococoocOlnoco^(^l'^OlOlOlOT^^c^lc^^ 
inGOOcoc:ic7icot-c^c^coc3iOininoococNi>i>tNcotO'^oi'* 
cNrHCNCNOcO(MT-iO'-icoococncoino^intoinrHtO'HCNo 
+ + + + + + I I I I I I I I l + l I 1 + 

C0(^(^(^00OC0C0 00 00 00tNUD00c:iOC31tNt>mt>00tN00'HCN 

OOO'-H'-icocoinininO'H'-i'-i'-itMco^'^inininoocNCM 
OOOOOOOOOOt-hi-Ii-Ii-Ii-Ii-Ii-Ii-i-h-h-h-hcNCNCNCM 
00000000000000000000000000 
00000000000000000000000000 
00000000000000000000000000 



Oi l:^ l:^ in in 

CO d d d 

in CM in ^ in 

to 01 o o 

in .H CM .H 

I I I I I 

C31 01 00 Ol 

CM CO ^ O O 

CM CM CM CO CO 



COt-iCMC75^C7;«Dt-iOO 

dincMcocMdoocooi 
incM'i'cMinocMc-jin 

'^i-lCOCO'l'Oli-lCMtO 
COt-h^-O'-iCOt-iO'H 
I I l + l I l + l 
oitotOT-itOT-iooooin 
OT-iT-icMco^^ino 
cocococococococo- 



00000000000000 
00000000000000 
00000000000000 



- 35 - 



circumstances might require astronomers to query only a certain area on the sky, which is 
supported by the Specify RA/Dec dialog where the RA, Dec and a search radius can be 
specified. Clicking single rows will show the individual source page that provides important 
information about the source including light curves, most probable member classes and a 
comments section to encourage feedback and collaboration within the variable star research 
community. For offline access to query results, the Export Query Results button will give 
the user a traditional CSV file for download. 

The online catalog has been built on top of the Google Fusion Tables framework, which 
allows quick and robust access to the data while minimizing administration overhead as- 
sociated with hosting a catalog like this. The sky position of each source is geocoded and 
indexed in the master fusion table to enable fast position-constrained queries. The front 
page provides full access to the Fusion Tables interface where queries and visualizations can 
be customized to the users needs. Furthermore, the site is built on the Google App Engine 
for Python, a cloud-based solution providing free hosting of web applications below a cer- 
tain usage limit and the ability to scale on demand in case of temporary or permanently 
increased traffic. These services, along with powerful client-side Javascript frameworks such 
as j Query and Info Vis Toolkit allow rapid development of data-centric web application with 
minimal administrational constraints, which are all important aspects in the presentation of 
a catalogue like this. 



All of the posterior class probabilities given in MACC assume that the prior probability 
of observing an object of class cj (before observing any data) is given by the proportion of 
training set objects that are of class Cj (provided in Table 3). By Bayes' Rule, the posterior 
MACC class probability for class Cj given the features, Xj, for object i, is 



where Ptr(cj) is the prior class probability given by the proportion of objects of class cj in 
the training set. To exchange a different vector of prior class probabilities, one must multiply 
each posterior probability from the catalog by the ratio of the new prior to the training set 
prior and multiply by the corresponding ratio of denominators from Equation 8. For a new 
prior Pnew(cj), the new posterior probabilities are given by 



4.1. Substituting Different Class Priors 






Pnew(Cj) ^fc=i P(Xj|Cfc)Ptr(Cfc) 



(9) 
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Fig. 9. — A screenshot of the BROWSE section of the http : / /www . bigmacc . info website, 
the onhne version of the Machine-learned ASAS Classification Catalog. Here, we show a 
subsection of the classification taxonomy where the user has selected the Eruptive branch, 
which is also reflected in the table below and the drop-down in the top-left corner. The 
embedded figure shows the information page for an individual source, which provides light 
curves, class probabilities and comments, along with other meta-data. 
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For modified priors, Pnew? that are not too dissimilar from Ptr, the last term in Equation 9 
will typically be near unity, and thus the modified posterior probabilities can be computed by 
multiplying the original posteriors by the prior ratio and appropriately re-normalizing. For 
very dissimilar priors, accurate estimates of all of the class-wise densities, P(xi|cfc), would 
have to be computed and stored on a fine grid of the 71-dimensional feature space, which is 
both statistically and computationally infeasible ^'^ In the absence of satisfactory estimates 
of the class-wise densities, and short of re-training the random forest classifier with different 
prior weights, it is reasonable to update the posteriors by assuming that the last term in 
Equation 9 is unity. 

The construction in Equation 9 allows us to also condition on additional information 
such as galactic coordinates (£,6), median magnitude, and/or distance. For instance, if we 
have a good theoretical understanding of the expected demographics of variable stars as a 
function of position in the galaxy, we can imbue that information into the prior probabilities. 
In other words, before observing any data for a particular object, we can modify its prior 
class probabilities solely based on its location in the galaxy. This can be a very powerful tool, 
e.g., for finding star-forming regions near the galactic plane, where the relative abundance 
of young stellar objects will be higher (and that information can be inserted into the class 
prior) . 



4.2. Difficult Class Boundaries 

There are certain classes of variability that are difficult to separate based on photometric 
information alone. For instance, W Uma Majoris, Delta Scuti, and RR Lyrae, FO stars all 
show variability on the same time scales with similar amplitudes. Other classes, such as 
Weak-line T Tauri and RS CVn stars exhibit variability from similar physical mechanisms 
(in this case, rotation of chromospherically active stars), which may result in ambiguous 
classification of sources of those classes based on light curve information alone. An advantage 
of using ML classification is that, given enough training data, these methods can learn which 
light curve features best separate sources of similar class and can determine optimal class 
boundaries. In Figure 10, we plot the most informative features for separating notoriously 
difficult-to-separate classes of variable star. Even with relatively few training instances, the 
classifier effectively learns how to best distinguish, e.g.. Delta Scuti and Beta Cephei. 



^■^Consider the most naive density estimate, a histogram. Constructing a 71-dimensional histogram for each 
class by binning each feature into 10 bins requires 28 x 10^^ values to be computed and stored. Statistically, 
such a density estimate is unreliable, as the amount of training data is microscopic compared to the vast 
feature space occupied by 71 dimensions, rendering any simple density estimate useless. 
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That said, there will always be borderline cases, for which, given their light curve and 
color data, it is impossible to confidently place the objects into a class. This uncertainty is 
reflected by low posterior class probabilities assigned by the classifier across all classes. In 
Figure 11, we plot the ASAS light curves for a few of the least confldently classifled (lowest 
maximal posterior probability) sources in MACC. These sources typically have poor data 
quality and/or fall in outlying regions of light-curve feature space, meaning that there is 
not enough light curve information from these objects for the classifler to make a confldent 
statement about their class. For comparison, in Figure 12 we plot a few of the ASAS objects 
whose light curves have low outlier score but whose highest posterior class probability is 
smaller than 0.5. These light curves do not show atypical behavior, but tend to reside on 
the boundary between classes. The objects ASAS 035050+0325.7, ASAS 114757-4118.8, and 
ASAS 051601+2237.1 all reside on the W Uma-Delta Scuti locus, making them impossible to 
classify with any degree of confldence. Likewise, ASAS 192648-1212.5, ASAS 191013-1233.4, 
and ASAS 120839-6858.6 all reside near the boundary between the SARG A-B classes. 
Also, both ASAS 003041-4416.4 and ASAS 210538+2005.0 are Cepheids with atypically 
high amplitudes and short periods that place them near the dividing line between Classical 
and Population II Cepheid stars. 



5. Comparison to Literature 

We conclude with a comparison of our Machine-learned ASAS Classiflcation Catalog 
with a set of papers that have performed classiflcation for ASAS objects. We flrst analyze 
the similarities and differences between our classiflcation catalog and the popular ACVS 
catalog. Subsequently, we take a closer look at a handful of papers that have attempted 
to flnd, in the ASAS data, objects of speciflc sub-classes. Overall, we flnd a very high 
classiflcation agreement rate with these other works. For cases in which our classiflcation 
catalog disagrees with the class-speciflc papers, the differences can be attributed to poor 
quality of the ASAS photometry and extra information that was unavailable to our classifler 
such as proprietary follow-up data including spectra and high-quality multi-band light curves. 



5.1. ASAS Catalog of Variable Stars (ACVS) 

As a part of the ASAS Catalog of Variable Stars, predicted classes are provided for a 
fraction of the stars. As described in Pojmahski (2002), ACVS obtains their classiflcations 
using a neural net type algorithm trained on set of visually labeled ASAS sources, conflrmed 
OGLE cepheids (Udalski et al. 1999a, b), and OGLE Bulge variable stars (Wozniak et al. 
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Fig. 10. — The Random forest classifier automatically discovers class boundaries in the high- 
dimensional feature space. For certain easily confused classes, we plot the projections, in 
two-dimensional feature spaces, of training objects (points with solid outline) and MACC- 
classified objects (small dots). Top Left: In the skew-first-harmonic-amplitude plane, W 
Ursae Majoris, RR Lyrae FO, and Delta Scuti stars are well separated, but Delta Scuti 
and Beta Cephei remain confused. Top Right: However, Delta Scuti and Beta Cephei are 
separated by their J — H color. Bottom Left: SARG A and B subtypes split naturally in 
the period-amplitude plane. Bottom Left: Beta Persei and Beta Lyrae binaries are largely 
separable by two features, with a small amount of overlap. 
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Fig. 11. — AS AS light curves for the candidates with lowest classification probability across 
all 28 classes. Several of these light curves suffer from lack of data (a), large temporal gaps 
(e), or large amounts of noise caused either by blending with nearby stars (f,g) or relative 
faintness (h). Others are outliers due to abnormal period-amplitude combinations (b), or 
secular variability on several year timescales (c,d). These objects — and others that obtain 
low probabilities across all 28 science classes — require further study to ascertain their true 
nature. 
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Fig. 12. — Light curves of ASAS objects whose anomaly scores are small even though their 
maximal classification probabilities are smaller than 0.5. These light curves show behavior 
not inconsistent with a particular class of variability but typically reside between classes. 
The objects in (a,b,c) reside on the border between Delta Scuti, W Uma Majoris, and RR 
Lyrae, FO variable stars. Likewise, the stars in (d,f) lie on the boundary between SARG A 
and B subtypes. Similarly, the objects in (g,h) are either Classical or Population II Cepheids. 
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2002). A filter is used to divide strictly periodic from less regular periodic sources. A neural 
net is trained on the period, amplitude, Fourier coefficients (first 4 harmonics), J — H and 
H—Ks colors and IR fiuxes to predict the classes of the strictly periodic sources. Many ACVS 
objects either have multiple labels or are annotated as having low confidence classifications, 
but no posterior class probabilities are given in the catalog. For less regular periodic sources, 
location in the J — H versus H — Kg plane is tested; if the object falls within an area of 
late-type irregular or semi-regular stars, it is assigned the label MISC, else it is inspected 
by eye. We find that 38,117 ACVS stars, representing 76% of the catalog, are either labeled 
as MISC, assigned multiple labels, or have low class confidence. The remaining 24% of 
stars have confident ACVS labels, and provide a set of classifications to compare against our 
catalog. 

In the top panel of Figure 13 we plot the class-wise correspondence between our clas- 
sifications and the ACVS classes. Overall, there is an 80.5% correspondence between our 
catalog and ACVS, for the 12,007 sources that are labeled confidently (and not as MISC) 
in ACVS. For each of the ACVS sub-classes, except Population II Cepheid and Multi-Mode 
Cepheid, we agree on at least 60% of objects. The large disagreement with the Population II 
and Multi-Mode Cepheids is consistent with the results of Schmidt ct al. (2009) who find ex- 
treme biases in Cepheid classifications for ACVS. Of 282 stars labeled as Cepheid by ACVS, 
only 14 were found by Schmidt et al. (2009) to be likely Pop. II cepheids, while all but ~ 60 
suffered from uncertain period estimates, and ~ 50 were rejected as obvious non-Cepheids. 
We also find that our classifications of First Overtone RR Lyrae, Delta Scuti, and W Ursae 
Majoris show a significant amount of discrepancy with those of ACVS. In particular, our 
classifier finds that ~22% of the stars that ACVS classifies as RRc or Delta Scuti are truly 
WUma eclipsing variables. 

In the bottom panel of Figure 13, we plot the class-wise correspondence for all 25,158 
ASAS sources with MACC outlier score smaller than 3.0. For these more confidently clas- 
sified objects, MACC has a closer correspondence with ACVS (91.6% for the 8,204 objects 
with confident ACVS class), but still shows high level of disagreement for the non-Classical 
Cepheids. Of these sources, we find a 97% agreement on Miras, 86% on Classical Cepheids, 
99% on RR Lyrae, FM, perfect agreement on 39 Chemically Peculiar stars, 98% on Beta 
Persei, and 93% on W Uma Majoris sources. 

5.1.1. Confident MACC Classifications missed by ACVS 

In addition to having > 80% correspondence with ACVS for objects which they confi- 
dently label, our MACC catalog identifies many confidently classified sources — having pos- 
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Fig. 13. — Top: Correspondence of the MACC to the ACVS classifications for all 50,124 
sources. Rows are normalized to sum to 100%, marginal counts are listed to the right and 
bottom of the table. There is an 80.5% total correspondence between our classifications 
and the ACVS labels, for the 12,007 objects whose ACVS classification is a single confident 
class not equal to MISC. Bottom: Same for the subset of 25,158 ASAS sources with outlier 
score smaller than 3. The agreement rate between MACC and ACVS for the subset of these 
sources with confident ACVS class (8,204 objects) is 91.6%. 
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terior class probability of at least 0.9 for a single class — whose ACVS classification is either 
uncertain (denoted by a ':' in the catalog) or split between multiple classes. In all, MACC 
identifies 187 Mira, 22 Classical Cepheid, 122 Fundamental Mode RR Lyrae, 11 First Over- 
tone RR Lyrae, 14 Beta Cephei, 43 Chemically Peculiar, 152 Beta Persei, 210 Beta Lyrae, 
and 1548 W Uma Majoris candidates that were not found by ACVS. Lowering the confidence 
threshold from 0.9 to 0.8 yields about 50% more good candidates. 

In Figures 14-15 we plot, for 8 different science classes, the ASAS light curves of selected 
Machine-learned ASAS Classification Catalog sources whose maximal class probabilities of 
belonging to that class are greater than 0.9 but whose ACVS classification is different or 
inconfident. Within each of these classes, the light curves appear as expected for each class 
of variability. MACC is better able to discover the classes of objects near the magnitude 
limit of ASAS and whose light curves are of lower SNR. 

5.2. Classical Cepheids: Berdnikov et al. (2011) 

Berdnikov et al. (2011) present multi-band light curves of 49 Classical Cepheid can- 
didates from the ACVS catalog, with data from the 76-cm telescope of the South African 
Astronomical Observatory and the 40-cm telescope of the Cerro Armazones Observatory 
of the Catholic University of the North, Chile. From these observations, they are able to 
confirm that 48 are Classical Cepheids and one, ASAS 100914-5714.6, is a Double-Mode 
Cepheid. Our classifier correctly identifies 46 of these 48 Classical Cepheids. See Table 5 
for a complete listing of our catalog classification, posterior probability of Classical Cepheid, 
ranking of Classical Cepheid probability out of all 50K ASAS sources, and anomaly score for 
all 49 objects observed by Berdnikov ct al. (2011). None of these objects is in our MACC 
training set. 

For two of these sources, ASAS 075750-2923.5 and ASAS 164120-4739.6, our catalog 
identifies the objects as Multi-Mode Cepheids. For the former, which has a period of 2.586 
days, there is significant scatter in the phased ASAS light curve and a relatively low am- 
plitude, making its ASAS light curve more consistent with a Multi-Mode Cepheid. It is 
likely the presence of a bright neighbor to ASAS 075750-2923.5 that causes this scatter 
and depressed amplitude. However, the light curve of Berdnikov et al. (2011), which only 
contains 9 epochs of data, does not completely rule out a Multi-Mode pulsator. For the 
latter, ASAS 164120-4739.6, the object is in the plane (Galactic latitude -0.842°) and has 
a bright neighbor, which again causes a large amount of scatter and suppressed amplitude 
in the ASAS light curve. Additionally, we find two outliers in the in the Berdnikov ct al. 
(2011) catalog, ASAS 073453-2651.3 and ASAS 100914-5714.6, whose Anomaly Scores are 
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Fig. 14. — ASAS light curves for arbitrarily chosen candidates with high probability of being 
(a) Mira, (b) Classical Cepheid, (c) Fundamental Mode RR Lyrae, and (d) First Overtone 
RR Lyrae whose ACVS classification either includes multiple classes, is insecure or MISC, 
or otherwise differs from that of MACC. 



- 46 - 



(e) (f) 



ASAS061713-0439.3 ASAS062325 + 1550.8 ASAS001S39+18S4.1 ASAS002130+2145.1 




ASAS065504-1629.7 ASAS070051-2335.8 ASAS011416 + 2245.6 ASAS012546-3956.2 




ASAS071413-0341.5 ASAS071838-2917.8 ASAS021522-1212.4 ASAS021521+2234.2 




(g) 



(h) 



ASAS024010-4555.0 



ASAS003355-1608.7 




ASAS000058+0236.7 

r-^^-- - ^tii-^Prob = 0.97 



ASAS000622-7621.a 



ASAS000815-2414.7 



P = 0.262 




ASAS001026-3739.9 




ASAS012755-2301.7 



ASAS014656-0945.1 



ASAS001109-5150.3 



ASAS001123-1732.5 















P = 0.586 d 





ASAS015606-0044.3 



Prob = 0.97; 





-ii^=_± Prob = 0^99 



ASAS001215-0718.6 



ASAS001702+1659.6 




Fig. 15. — Same as Figure 14 for: (e) Chemically Peculiar, (f) Beta Persei, (g) Beta Lyrae, 
and (h) W Uma Majoris. 
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greater than the outher threshold of t* = 10.0 (§3.4). The object ASAS 100914-5714.6 is 
the Double-Mode Cepheid confirmed by Berdnikov et al. (2011), and appears as an outlier 
due to its anomalously large amplitude and long period for a Double-Mode Cepheid. The 
other outlier, ASAS 073453-2651.3, has a median brightness fainter than 14th magnitude, 
making it barely detectable in ASAS which results in noisy light curve and underestimated 
amplitude. 

5.3. Beta Cephei: Pigulski (2005) 

In the work of Pigulski (2005), 14 new Beta Cephei stars appearing in ACVS were 
confirmed (in addition to 4 other previously known Beta Cephei stars). Starting with all 
37 stars whose ACVS classification includes BCEP as a possible class, the author makes 
selection cuts based on the ASAS periodogram and any available multi-band photometry 
and/or spectral type, finding 14 stars that the author deems as unambiguous. Then, with 
a broader set of 1700 ASAS stars, Pigulski detects 4 more bona fide candidates using the 
same selection criteria. 

In Table 6 we report our catalog's classification for each of the Pigulski (2005) Beta 
Cephei. Note that all but one of these sources was included in the MACC training set. 
We mis-identify as a Delta Scuti star the one object (ASAS 161858-5103.5) that was not 
included in the training set. This star is located directly in the galactic plane with Galactic 
latitude of —0.536°, and suffers from heavy extinction. Thus its observed colors are more 
typical of the comparatively redder class of Delta Scuti stars than the bluer class of Beta 
Cephei. With a Beta Cephei posterior class probability of 0.243, it ranks within the top 400 
Beta Cephei candidates. 

5.4. Double-Mode RR Lyrae: Szczygiel & Fabrycky (2007) 

Szczygiel & Fabrycky (2007) perform a search for multiple-pulsating RR Lyrae stars 
in ASAS. Starting with all objects with a RR Lyrae classification in ACVS, this study first 
culled out obvious non-RR Lyrae stars via visual inspection. They pre-whiten each ASAS RR 
Lyrae light curve at the pulsation period and run the CLEAN algorithm to find any significant 
periodicity in the residual light curves. From this analysis, they identify of order 150 Blazhko 
affected RR Lyrae and 19 Double-Mode RR Lyrae stars. The Double-Mode pulsators were 
identified by making cuts on the pulsation period (Pq) and the ratio of the overtone to 
fundamental periods (0.735 < Pi/ Pq < 0.755) and confirmed via visual inspection. 
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Table 5. Classification catalog results for Classical Cepheid stars confirmed by Berdnikov 

et al. (2011). 



ASAS ID 


Predicted 


Class 


P(Classical Cepheid) 


Rank CCcph 


Anomaly Score 


In Training 


052610+1151.3 


Classica 


C 


epheid 


0.9995 


45 


0.185 


no 


052706+1656.2 


Classica 


C 


epheid 


0.9591 


179 


1.165 


no 


062939-1840.5 


Classica 


C 


epheid 


0.9544 


185 


1.242 


no 


064037+1143.6 


Classica 


C 


epheid 


0.98 


142 


0.976 


no 


064829-1014.2 


Classica 


c 


epheid 


0.9246 


217 


1.488 


no 


070355-1752.8 


Classica 


c 


epheid 


0.988 


126 


0.859 


no 


071342-1737.2 


Classica 


c 


epheid 


0.9284 


214 


1.89 


no 


073113-2811.0 


Classica 


c 


epheid 


0.9639 


170 


1.525 


no 


073453-2651.3 


Classica 


c 


epheid 


0.2799 


559 


12.158 


no 


073502-3554.9 


Classica 


c 


epheid 


0.915 


226 


1.66 


no 


074925-3814.4 


Classica 


c 


epheid 


0.9733 


155 


0.866 


no 


075345-3658.2 


Classica 


c 


epheid 


0.9997 


39 


0.235 


no 


075358-2822.1 


Classica 


c 


epheid 


0.8161 


277 


1.525 


no 


075750-2923.5 


MultiMod 


e Cepheid 


0.0899 


789 


3.808 


no 


075840-3330.2 


Classica 


C 


epheid 


0.9927 


109 


0.582 


no 


075912-2641.9 


Classica 


C 


epheid 


0.6498 


341 


4. 102 


no 


080500-2851.8 


Classica 


C 


epheid 


0.99 


118 


0.475 


no 


080511-3421.7 


Classica 


C 


epheid 


0.9806 


140 


0.73 


no 


080927-3315.7 


Classica 


C 


epheid 


0.9674 


164 


1.101 


no 


081025-3828.4 


Classica 


C 


epheid 


0.9107 


227 


1.825 


no 


081026-3231.3 


Classica 


C 


epheid 


0.9796 


143 


0.887 


no 


082117-3845.3 


Classica 


C 


epheid 


0.9419 


203 


0.961 


no 


082127-3825.3 


Classica 


C 


epheid 


0.9608 


176 


1.252 


no 


082859-3613.9 


Classica 


C 


epheid 


0.997 


85 


0.348 


no 


083130-4429.3 


Classica 


C 


epheid 


0.5067 


419 


5.173 


no 


083426-3559.1 


Classica 


C 


epheid 


0.9936 


106 


0.475 


no 


083611-3903.7 


Classica 


C 


epheid 


0.7172 


309 


1.924 


no 


084127-4353.6 


Classica 


C 


epheid 


0.8537 


258 


2.521 


no 


090436-4633.2 


Classica 


C 


epheid 


0.6235 


356 


7.621 


no 


090932-5359.3 


Classica 


C 


epheid 


0.9688 


162 


1.024 


no 


092758-5218.9 


Classica 


C 


epheid 


0.9592 


178 


1.083 


no 


093005-5137.5 


Classica 


C 


epheid 


0.9373 


209 


0.961 


no 


094819-5748.6 


Classica 


C 


epheid 


0.6993 


316 


3.167 


no 


094827-5801.1 


Classica 


C 


epheid 


0.9955 


96 


0.453 


no 


100914-5714.6 


Classica 


C 


epheid 


0.5148 


415 


10.905 


no 


101037-5817.8 


Classica 


C 


epheid 


0.797 


285 


2.521 


no 


101538-5933.1 


Classica 


C 


epheid 


0.4654 


442 


4.952 


no 


103627-6211.6 


Classica 


C 


epheid 


0.8333 


268 


1.488 


no 


112039-6149.9 


Classica 


C 


epheid 


0.8839 


239 


2.226 


no 


115701-6218.7 


Classica 


C 


epheid 


0.3507 


517 


9.638 


no 


122240-6209.5 


Classica 


C 


epheid 


0.858 


254 


1.874 


no 


123804-3831.4 


Classica 


C 


epheid 


0.9529 


189 


1.008 


no 


140742-6315.4 


Classica 


C 


epheid 


0.9268 


215 


1.66 


no 


150547-5823.0 


Classica 


C 


epheid 


0.9529 


190 


1.193 


no 


152021-5807.3 


Classica 


c 


epheid 


0.8242 


272 


3.63 


no 


164120-4739.6 


MultiMod 


c Cepheid 


0.2179 


576 


3.95 


no 


173253-3554.7 


Classica 


C 


epheid 


0.7444 


297 


4.435 


no 


174134-4854.6 


Classica 


C 


epheid 


0.5434 


398 


4.814 


no 


181416-0920.4 


Classica 


C 


epheid 


0.5961 


369 


4.263 


no 
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Table 6. Classification catalog results for Beta Cepliei stars in Pigulski (2005). 



ASAS ID 


Predicted Class 


P(Beta Cephei) 


Rank BetCep 


Anomaly Score 


In Training 


091731-5250.3 


Beta Cephei 


0.9767 


6 


2.497 


yes 


180233-4005.2 


Beta Cephei 


0.9942 


1 


0.923 


yes 


191715-f0103.6 


Beta Cephei 


0.9563 


13 


3.237 


yes 


212329-^0955.9 


Beta Cephei 


0.9547 


14 


5.329 


yes 


122213-6320.8 


Beta Cephei 


0.9781 


5 


2.03 


yes 


150955-6530.4 


Beta Cephei 


0.9864 


3 


2.106 


yes 


161858-5103.5 


Delta Scuti 


0.2428 


393 


3.237 


no 


164409-4719.1 


Beta Cephei 


0.9301 


19 


2.846 


yes 


164630-4701.2 


Beta Cephei 


0.9436 


17 


1.793 


yes 


164939-4431.7 


Beta Cephei 


0.9384 


18 


2.356 


yes 


165314-4345.0 


Beta Cephei 


0.9695 


9 


2.356 


yes 


165554-4808.8 


Beta Cephei 


0.9703 


8 


1.841 


yes 


171218-3306.1 


Beta Cephei 


0.9675 


11 


2.356 


yes 


180808-3434.5 


Beta Cephei 


0.9687 


10 


2.497 


yes 


181716-1527.1 


Beta Cephei 


0.9807 


4 


2.497 


yes 


182610-1704.3 


Beta Cephei 


0.9729 


7 


1.551 


yes 


182617-1515.7 


Beta Cephei 


0.9541 


15 


5.25 


yes 


182726-1442.1 


Beta Cephei 


0.9526 


16 


1.841 


yes 
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The MACC classification, posterior probability of Double-Mode RR Lyrae, ranking of 
RRd amongst all ASAS sources, and anomaly score for the 19 confirmed RRd from Szczygiel 
& Fabrycky (2007) are in Table 7. MACC correctly classifies all 19 stars even though only 
two of them were in our training set. Each of the stars has posterior probability of being a 
Double-Mode RR Lyrae of > 0.45 and each ranks within the top 41 RRd candidates. 

5.5. Orion Belt Objects: Caballero et al. (2010) 

In a search for high-amplitude variable stars in the Orion Belt, Caballero et al. (2010) 
identify 32 variable stars from ASAS photometry, proper motions, and infrared photometry 
(2MASS and the Infrared Astronomical Satellite (IRAS)). They perform an extensive litera- 
ture search on these objects and visual analysis to determine a likely classification for each. 
Of these 32 variable stars, 13 are in our catalog, and their classifications are listed in Table 
8. Our classifications agree with those of Caballero et al. (2010) for 9 of the 13 objects. 

For four objects, we disagree with the classifications of Caballero et al. (2010). The 
star ASAS 053621-0210.9 (PQ Ori) was found by us to be a semi-detached (Beta Lyrae) 
eclipsing system, while Caballero ct al. (2010) note that although it has been identified as 
a possible young stellar object in the literature, its colors are too blue and it is more likely 
a field star. The star ASAS 053946-0055.9 was identified by us as either a LSP or RS CVn, 
consistent with the classification of Schirmer et al. (2009), while Caballero et al. (2010) 
retain it as an uncertain T Tauri candidate. The star ASAS 053543-0034.6 is claimed by 
Caballero et al. (2010) to have signs of youth; however, we find significant periodicity on 
86.61-day time scales, which is consistent with the pulsations of a RV Tauri star. Finally, 
ASAS 053642+0038.5 is identified by our catalog as a likely W Ursae Majoris candidate due 
to it's tell-tale eclipsing structure on 1.06-day time scales; Caballero et al. (2010) claim that 
it is a possible HAeBe star, though they note that it has anomalous brightness. 

6. Conclusions 

We have presented an end-to-end methodology for creating a probabilistic classifica- 
tion catalog for a time-domain survey of variability. With growing data volumes and rates, 
these types of automated classification catalogs become necessary for astronomers to make 
sense of such a vast amount of data and to optimize the allocation of limited follow-up re- 
sources. Though machine-learned construction of accurate classification catalogs is certainly 
a difficult undertaking, we have shown that sub-20% error rates are achievable even with as 
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Table 7. Classification catalog results for double-mode RR Lyrae stars in Szczygiel & 

Fabrycky (2007). 



ASAS ID 


Predicted Class 


P(RRd) 


Rank RRd 


Anomaly Score 


In Training 


UoZozU-D40o. / 


xixi L/yrae uivi 


u.oyoo 


1 c; 


z. ioO 


no 


U4UU04-4yzo.o 


xixi ijyrae uivi 


n Qi 7/1 

U.ol / 4 


O 


1 Q7/1 
i.o (4 


no 


UoiDiU-DD44.0 


xixi L/yrae jjivi 


U.0Dz4 


90 

zy 


O.4Z0 


no 


Uo4 / 4 ( -Uooy . 1 


1 V 1 V ijyrae uivi 


u.0oy4 


'^9 
oZ 


/I 81 /I 
4.oi4 


no 




1 V 1 1 ijyrae uivi 


U.4DZ0 


A 1 
41 


4 


no 


loo4oy-hz4iD.D 


xixi Ijyrae uivi 


n (^7/1 

U.D ( 40 


1 O 

iy 


O.DO 


no 


1/11 p;QO_i_nni n i 
I4iooy-|-Uuiu. i 


xixi ijyrae uivi 


u. ( zoo 


1 9 
iZ 


o. ioz 


no 


151735-0105.3 


RR Lyrae DM 


0.6488 


22 


3.31 


no 


173726+1122.4 


RR Lyrae DM 


0.7197 


13 


2.571 


no 


183952-3200.9 


RR Lyrae DM 


0.9174 


3 


6.463 


yes 


184035-5350.7 


RR Lyrae DM 


0.9682 


2 


0.953 


no 


193933-6528.9 


RR Lyrae DM 


0.8154 


6 


2.145 


no 


195612-5043.7 


RR Lyrae DM 


0.9872 


1 


1.11 


yes 


2107264-0110.3 


RR Lyrae DM 


0.6898 


17 


2.65 


no 


211848-3430.4 


RR Lyrae DM 


0.7864 


10 


2.676 


no 


212721-1908.0 


RR Lyrae DM 


0.7902 


9 


1.841 


no 


213437-4907.5 


RR Lyrae DM 


0.6397 


23 


3.673 


no 


230449-3345.3 


RR Lyrae DM 


0.7141 


14 


3.132 


no 


235622-5329.4 


RR Lyrae DM 


0.6855 


18 


2.571 


no 



Table 8. Classification catalog results for Orion Belt variables in Caballero et al. (2010). 



ASAS ID 


Predicted Class 


P(Class) 


Anomaly Score 


In Training 


Caballero Class 


054354-0243.6 


W Ursae Maj 


0.9993 


0.33 


no 


Contactbinary 


053848-0227.2 


Weakline T Tauri 


0.3377 


6.463 


no 


TTauri 


053621-0210.9 


Beta Lyrae 


0.8389 


3.274 


no 


HAeBe 


053739-0146.3 


Mira 


0.999 


0.163 


no 


Giant 


053757-0140.8 


Semireg PV 


0.8433 


2.086 


no 


Giant 


053126-0058.6 


W Ursae Maj 


0.7972 


2.145 


no 


Unknown 


053946-0055.9 


LSP 


0.326 


10.111 


no 


TTauri? 


052725-0035.2 


SARG B 


0.5592 


2.289 


no 


Giant 


053543-0034.6 


RV Tauri 


0.478 


4.495 


no 


TTauri 


052634-0019.5 


SARG B 


0.8339 


3.032 


no 


Giant 


054612-1-0032.4 


ClassT Tauri 


0.3584 


4.882 


no 


Unknown 


053642-1-0038.5 


W Ursae Maj 


0.47 


6.353 


no 


HAeBe? 


053348-F0055.6 


SARG B 


0.5954 


3.167 


no 


Giant 
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many as 28 classes and sub-classes of stellar variability. Furthermore, we have motivated the 
importance of disseminating probabilistic classifications with full disclosure of class priors, 
allowing each user freedom to trade class purity for efficiency and to use full probabihty vec- 
tors in performing astrophysical inference (for a recent use of probabilities for cosmological 
parameter estimation, see Ncwling ct al. 2012). Additionally, it is crucial that the classi- 
fication probabilities be calibrated so that the natural interpretation of probability holds, 
allowing for faithful propagation of that information to downstream analyses. 

As a test case for the methodologies presented in this paper, and those adopted from 
Richards ct al. (2011) and Richards ct al. (2012), we build and make publicly available a 
28-class Machine-learned ASAS Classification Catalog of 50,124 sources that are included 
in the ASAS Catalog of Variable Stars. We show that accurate classifications are possible 
for such a complex, noisy and diverse data set of photometric light curves. Furthermore, we 
demonstrate that calibrated probabilities are attainable using straightforward methodology 
and that semi-supervised anomaly detection can discover interesting objects that do not fit 
within a predefined classification taxonomy. Comparisons of our MACC with existing ASAS 
classifications, including those in ACVS, are favorable and we eagerly await more intense 
scrutiny of the publicly available MACC from the astronomical community. Inevitably many 
of our top classifications will be proven incorrect, but that is expected by the very nature 
of the product: it is, instead, the testing of the aggregate accuracy of our probabilistic 
classifications that are of most interest long term. 

Some degrees of the predicted accuracy and functionality of the MACC catalog have 
already been demonstrated in the concurrently submitted paper of Miller ct al. (2012). In 
that paper, MACC was used to search for previously unknown R Coronae Borealis and 
DY Persei stars in ASAS. Their search through the top MACC RCB candidates yielded 
12 likely RCB/DYPers stars, whereby they confirmed with new and archival spectroscopic 
observations the discovery of four RCB stars and four DYPers, increasing the number of 
known Galactic DYPers from 2 to 6. Miller et al. (2012) demonstrate that the MACC catalog 
recovers ASAS candidates that would have been missed via the typical search method which 
uses hard cuts on the amplitude and periodicity of the light curves, and that a prohibitive 
number of objects would have to be manually searched via those traditional methods to 
recover all of the newly discovered objects. This is powerful validation that ML probabilistic 
classification can facilitate astronomical discovery and enable scientific results. 

Moving forward, there remain many pending tasks for our machine-learned approach to 
classification catalogs. First, we have not touched on the question of discovery of variability, 
only on classification once variable objects have been identified. Recently, Shin et al. (2009) 
have introduced a machine learning approach to variability selection which we will expand 
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to develop new procedures. Second, the size and scope of MACC, at 50k variable stars 
at a brightness level reaching 14th magnitude, is rather small and limited. Tackling larger 
catalogs with millions of sources will test the feasibility of our algorithms and robustness 
of our statistical approaches. Third, the future of time-domain surveys is multi-band light 
curves (e.g., DES, LSST). Neglecting the full use and exploitation of multi-band photometry 
would mean throwing away much useful information. Last, a large component of the catalog- 
building techniques that we have presented is the constant feedback from the automated 
classifier and the astronomical community. From compiling large and representative training 
sets to inventing new features that probe different types of variability, constant injection of 
more information into the machine learner is essential to optimize the accuracy, information 
gain, and ultimately the scientific impact of the catalog. 

The authors acknowledge the generous support of a GDI grant (#0941742) from the 
National Science Foundation. This work was performed in the GDI-sponsored Genter for 
Time Domain Informatics (http://cftd.info). We would also like to thank IBM and 
GITRIS for providing the 280 core cluster at Berkeley, which was used to perform feature 
computations and classifier evaluations. 
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